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Abstract 

This paper proposes a statistical generalized species-area model (GSAIVl) to represent various patterns of species-area 
relationship (SAR), which is one of the fundamental patterns in ecology. The approach enables the generalization of many 
preliminary models, as power-curve model, which is commonly used to mathematically describe the SAR. The GSAIVl is 
applied to simulated data set of species diversity in areas of different sizes and a real-world data of insects of Hymenoptera 
order has been modeled. We show that the GSAM enables the identification of the best statistical model and estimates the 
number of species according to the area. 
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Introduction 

The variation in the number of species with area, Icnown as 
species-area relationship (SAR), is one of the most important 
ecological patterns [1]. The models of SAR enable the prediction 
of the number of species that coexist and share resources, as well as 
the impact of the extinction of species caused by habitat loss. 
Sampled data for a single species, or all species of a specific trophic 
level within a particular site have shown that the SAR has a well- 
defined shape, most often described by power and exponential 
curves [2]. The number of species in an area increases with 
increasing island area, but the rate of increase slows for larger 
islands. Many hypotheses have been proposed to explain the SAR 
[3,4,6]. For instance, some are based on the immigration and 
extinction of species [4], random sampling processes [6] or the 
Second Law of Thermodynamics [5] . 

These different hypotheses have generated many mathematical 
models for the description of the SAR [1-3,7-10]. The early 
models were based on deterministic modeling, which assumes that 
every set of variable states is uniquely determined by the 
parameters in the model. For instance, Arrhenius considered that 
the number of species (S) is related to area (A) through a power law 
form [11] (called the power-function), i.e. S = SoA- (or 
logS= logc + zlogA), where So represents the number of species 
in a unit area [A = l) and 0<z< 1. Due to the random nature of 
the sampled data, statistical modeling is more suitable for SAR 
description than the deterministic approach [6], therefore, many 
statistical models have been developed (e.g. [2]). Moreover, 
statistical models can be thought of as general cases of 
deterministic models, because the mean value of the random 
variable of interest yields the results of the deterministic model. 



Because there exist many models to address the SAR (e.g. 
[3,9,12]), a natural question is how to select the best model for a 
given data set. To address this issue, here we integrate different 
models within a common framework and consider the problem of 
curve fitting by the transformed generalized linear model (TGLM) 
[13]. We propose the use of the generalized species-area model 
(GSAM) to describe the SAR. GSAM includes many models, such 
as those described in [3,9] as special cases. We also consider a 
model that simulates the colonization process of a region by 
different species and show that the GSMA has best fitted the data 
in comparison with traditional power-curve models. Finally, we 
use the data on the cumulative species richness of parasitic 
Hymenoptera from 25 nested plots in a beech forest on limestone 
[14]. Our results show that the GSAM can determine the best 
model for the data and estimate the number of species accurately. 

Methods 

Generalized species-area models 

In species-area curves, the number of species (S) is the 
dependent variable and the area (A) is the explanatory variable. 
Some mathematical models of SAR propose that the number of 
species is related to the area as 

Si = n{Ai,P), i=\,...,n, (1) 

where jS is a ^-parameter vector [1]. Function ^ can be derived 
from laws governing the physical system that gave rise to the data. 
As such models are deterministic and the properties related to the 
random nature of variable S are neglected, the deterministic 
models are often inadequate due to the stochastic nature of the 
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data [6]. The statistical modeling usually assumes that Eq. 1 can 
be written as 

Si = KAi,P) + ei, (2) 

where {s,} are independent and identically distributed random 
noises (i.i.d) — usually, s, ~ N(0,a^). Note that the mode in Eq. 2 
is a generalization of the determinic model, i.e. E(Si) = jx(Ai,P). 
The model given by Eq. 1 can be understood as a particular case 
of the model £{ A(5',-,i)} = fi(Ai,fi), where A( • • • ,X) is a mono- 
tonic transformation and A is a scalar parameter defining such a 
transformation. For instance, in cases whose data suggested by Eq. 
2 are unsatisfactory, the experimenter can assume a model with 
logarithmic transformation, i.e. 



log(5',) = M^,-,/?) + 6,. (3) 

This paper proposes a new model called generalized species-area 
model (GSAM), which is based on the TGLM approach proposed 
in [13]. The GSAM works with a general parametric family of 
transformations from the dependent \ ariable S to S'^' = A(S; l) 
and postulates that the transformed random variable S*^' follows a 
continuous probability distribution belonging to the exponential 
family. Furthermore, the GSAM assumes that there exists some A 
value such that 5^'*' satisfies the usual assumptions of the 
generalized linear models (GLM) [15]. 

A suitable choice of the family of transformations enables the 
representation of power-curves, their recent extensions (see 
[11,16-19]), the models presented in [2,3,9] and the logarithmic 
model described in [16] as special cases of the GSAM. We have 
considered the Box-Cox power transformation [20], which is 
effective at turning skewed unimodal distributions into nearly 
symmetric normal-like distributions. 

Let S = {si,. . .s„Y be the vector of observations. By using 

{5^ — 1 
log(S) ifA = 0, 

we can obtain the transformed observations 5*^' = . . . jj/')' . 
The GSAM assumes that there exists some A value such that the 

transformed random variables {^'J'^', . . . 5*^^'} can be considered 

independently distributed. Each S*'' follows an exponential family 
distribution with a probability density function of the form 

n{sf;ei4)= Q^v[r'{sfei-bm]+c{sfs], (5) 

where h(x) and c(x,0) are appropriate known functions. The 
dispersion parameter ^ is assumed to be the same for all 
observations. The mean and variance of are, respectively, 
E{Sf^} = lii = db(di)lde, and Fflr{5'P} = (^F(/i,), where 
ViHi) = (fb{ei)lde] = dUildOi is the variance function. Parameter 
Qi = J V^^{iij)dHj = q{iij) is a known one-to-one function of jU,. 
The GSAM also considers a systematic component given by 

giH-) = tj- = xlfi, (6) 

where the link function g{-) is a known one-to-one continuously 
differentiable function and x,- is a specified vector (p x 1) of the 



explained variables, which include the area, known functions of 
the area, and other environmental variables. Matrix X whose rows 
are vectors xf, «= 1, ...,«, is a specified nxp model matrix of fuU 
rank p<n and p^ifii, . . . ,Pp)^ is a set of unknown linear 
parameters to be estimated. The Unk function is assumed to be 
monotonic and differentiable. 

The GSAM proposed here considers three components of 
structural importance: (i) the Box-Cox family of transformations 
(Eq. 4) in association with a more general form for the distribution 
of the transformed variable 5^^' (Eq. 5); (ii) a linear predictor 
function and (iii) a possible nonlinear link function for the 
regression parameters (Eq. 6). Moreover, when the variance 
function V(jii) is not constant, i.e. when the variance is correlated 
with mean /j, some distributions of the exponential family enable 
the handling of data presenting heteroscedasticity. In this context, 
GSAM is a generalization of the previous mathematical models 
that describe the SAR. 

Species-area relationship models. Many models have 
been proposed for the description of the SAR and some can be 
linearized by a logarithmic transformation of the response variable 
(i.e. diversity of species). Models that are special cases of the 
GSAM have the following properties: (i) the transformation 
parameter in Eq. 4 is A = 0, i.e. a log-transformation is adopted, 
5'<'"= logC^) and // = £{S*°>}=£{log(5')} or no transformation 
is considered for variable S, i.e. we assume 1=1 and ji = E{S}\ 
(ii) the distribution in Eq. 5 is the normal distribution; and (iii) the 
link function is the identity function, g(p.) = /i and the systematic 
component in Eq. 6 is given by /i = Xfi. The elements of matrix X 
may be area A, \og(A) or additional variables, as in [12]. Very 
simple forms of the systematic component are given by 

= + log a or /< = jSo -I- ji^A. This special case of the GSAM 
can be understood as the particular cases proposed in [11,16]. A 
list of some models of SAR is provided. 

1. Considering the stochastic nature of variable S, which 
represents the number of species, the power model proposed 
by Arrhenius [1 1] can be written as, 



£{S} = M*i- (7) 
The logarithm of variable S yields 

E{MS)} = (ia + lh\n(A), (8) 

where the parameters of the power-curve in log-log space are 
)So= ln(6o) and Pi=b\. 

2. The persistence model (PI -full) proposed by Plotkin et al. [17] 
is given by 

i?{5}=M*iexp||]6fc+i^*=|, (9) 
or, considering the logarithm of variable S, 

£{ln(S)} = ;8o + j6iln(^)+ (10) 

k=\ 

where /jQ=ln(^)o), P\=b\ and l^k+i=bk+i, k=l, . . . ,n. A 
special case, when n= \ (PI model [17]) is given by 
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Figure 1. Data and fitted curve obtained by tKie GSiVIA model for the simulated species-area relationship. 

doi:1 0.1 371 /journal.pone.01 051 32.g001 



£{S} = M''i exp{62^}, 
or, considering the logarithm of variable S, 

E{\n(S)} = lio + fh MA) + M 



(11) 



(12) 



where /?o= In(Ao), P\ = b\ and ^2 = ^2- 
3. The persistence model proposed by Ulrich and Buszko [18,19] 
is given by 



£{S} = /?o^*i exp 



67 



or, considering the logarithm of variable S, 



£{ln(S}) = /Jo + /ii ln(^)+^, 



(13) 



(14) 



where /?o= In(Ao), P\=b\ and ^2=^2- 
4. The polynomial power-function model proposed by Chiarucci 
et al. [21] is defined as 



£{S} = lo£/^=o**-^'0, 
or, considering the logarithm of variable S, 



E{\MS)}=Y,hA\ 



(15) 



(16) 



where yS/^- = In (10**), /: = 0,1, . . . The quadratic power- 
function model proposed in [21] considers n = 2, i.e. 

E{S} = 10*''o+*i '°s<-^)+*2('°g(-^))^i"^ (17) 
or, considering the logarithm of variable S, 

E{\n (S)} = Po + Pi log (A) + 1^2 (log (A))-, (18) 
where = In (10**), k = 0,l,2- 

Some species-area relationships may also be represented by 
linear functions, specifically: 
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Table 1. Some of the traditional models adjusted with Gaussian errors. 



fi = E{S} 


Selection Criteria 










IVIUUCI3 


AlC 


BIC 




MSE 


MAPE 


[4] 


406.59 


424.09 


400.59 


0.608 


0.305 


[16] 


297.87 


315.34 


291.87 


0.069 


0.086 


[22] 


228.63 


251 .93 


220.63 


0.016 


0.050 


H = E[log(S)} 


Selection Criteria 










Models 


AlC 


BIC 


-llogL 


MSE 


MAPE 


[11] 


405.91 


423.38 


399.91 


0.340 


0.155 


[17]) 


358.24 


381.54 


350.25 


0.123 


0.091 


[18] 


331.76 


355.05 


323.76 


0.082 


0.071 


[21] 


221.09 


244.39 


213.09 


0.010 


0.025 



doi:10.1371/journal.pone.0105132.t001 

5. Linear model propo.sed by Mac Arthur and Wilson [4] 

E{S}=bo + biA. 

6. Logarithmic function proposed by Gleason [16] 

E{S} = P, + liiln{A). 



(19) 



(20) 



7. Quadratic logarithmic function proposed by Gitay et al. [22] 



E{S} = {ho + hi \n{A)y, 



(21) 



E{S} = /io + /?! In (A) + p2 (In (Af, (22) 
where 1^0 = ^0' P\=^bob\ and P2=b\ 

8. General power-logarithmic function proposed by Gitay et al. 
[22] 



If 62 is any real number and |fto| > |6i In (^)|, from Newton's 
generalized binomial expansion we obtain 



{Z>o + Z.iln(^)}"2 = (M"2E(t)(Q (In(^)f. (24) 

where the binomial coefficients with an arbitrary upper index can 
be defined as 



bi\ b2{b2-\)---{b2-k+\)) {b2)k 



k I k\ k\ 

Therefore, the logarithmic function model can be written as 



(25) 



E{S} = |k+Y.Pk'^^''(^)t^ (26) 

k=\ 

where iSo = (*o)*2, li, = {bop{(b2)Jk\}{bUb,)'. 

Full-scale generalized species-area relationship 

model. The right side of all equations presented in the previous 
section always involves polynomial terms, such as In {A), A and/ or 
\/A. Here, we propose a generalization of these models by 
considering the right side of the full-scale model consists of three 
polynomials, i.e. 



£{5} = {6o + 6iln(^)P. (23) 
Table 2. The GSAMs fitted with different models according to the likelihood method 



Pi(^)=X^ft,(ln(^))^ 

k=\ 



(27) 





Models 


Sif) 


Systematic component 




normal 




/?„ + /!, \MA) + h(\^Af + hlA 




gamma 


1 


+ \n{A) + jh(\nAf + jl,IA- 




I.G. 


1 


A + ft \MA) + ^,{\nA)- + fSJA- 





doi:l 0.1 371 /journal.pone.Ol 051 32.t002 
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Table 3. Selection criteria for the GSAMs fitted witli A adjusted according to the likelihood method. 



I^Q^gl Parameter Selection Criteria 



i AIC BIC -2logL MSE MAPE 



normal 


0.822(0.729; 0.922) 


79.90 


109.02 


69.90 


0.08 


0.61 


gamma 


0.328( -0.003,0.680) 


116.82 


151.76 


104.82 


0.09 


0.64 


I.G. 


-0.089(-0. 322,0. 141) 


142.87 


177.82 


130.88 


0.11 


0.68 



doi:l 0.1 371/journal.pone.01 051 32.t003 



P2(A)= J2 Pk 



k = m+\ 



k-ii 



(28) 



and 



P3(A)= J2 Pi'A'- 



(m+n) 



(29) 



The left sides of those equations have the number of species [S) or 
In (S). In order to generahze them, we have assumed that 5 is a 
random variable and considered the Box-Cox power transforma- 
tion (Eq. 4). 

The curves defined by the GSAM assume a linear predictor 
fimction and a nonlinear link function g(/x) for the systematic 
component (Eq. 6). The systematic component of the GSAM is 
given by 



A.-=l 



m n / 1 \ ' 



(30) 



k=\ 



The GSAM has other models as special cases. For instance, the 
persistence function, P2-fiill model, is a special case of GSMA if we 
consider the identity hnk: function g(f.i) = fi, where /.i = E{\og (S)}, i.e. 



/; = /?o + i5ihi(^)+£ft,^ 



/f=i 



(31) 



The persistence model, P2-full, can be written as 



A< = /?o+^/S,(ln(^))*+^/i„,+,(-) , (32) 

A-=l /f=l 



Although the model defined by Eq. 30 has a large number of 
parameters (theoretically, it can have an infinite number of 
parameters), in practice the fitted models have no more than six 
parameters. The advantage of such a model is that it enables the 
formulation of hypothesis testing for the choice of the parameters 
to be removed from those that are significant for better describing 
the SAR. 

Model fitting. The parameters to be estimated in the GSAM 
are /, P and <j) (Eqs. 4 and 6). In order to obtain maximum 
likelihood estimates for the vector of parameters jS and dispersion 
parameter (j>, we have defined a profiled likelihood function for X 
and used the same algorithm proposed in [13]. By assuming the 
model given by Eq. 5, the log-likelihood function for the vector of 



the transformed observations S' 



,^J/')^ can be written 



+ Y\'{sf\<i>n\og{j{x,sd}\, 



(33) 



where 9, = J V ^{f^Lj)di.ij = q(i.Lj) and J(A,Si) = \Sif'^ 
/= 1, . . . ,« is the Jacobian of the transformation from S to 5''''. 

The procedure described in [13] is used for making inferences 
about parameters {fi,(j)) first assuming that X is fixed and obtains 
the log-likelihood equations for estimating yff'''' and (j>^'K The 
maximum likelihood estimates (MLE) of yff, ^ and (j) for a given X 
are denoted by yff'-", fi^'-^ = X'pf-'\ fi^'^^i =g-\if'-^) and 
respectively. can be calculated, without knowledge on (^'^\ 
adjusting the GSAM (Eqs. 5-6) to 5*^* by iteration. 

The iteration starts with an initial set of values k=\, 



used to evaluate W 



and 



Mik) 



where 



diaglw' 



} is a diagonal matrix with weights 



(34) 



where ji^ = In(foo), = b\ and P^^i = bk+\, k=l, . . . ,n. 



Table 4. Normal GSAM model fitted by the systematic component shown in Table 2. 







Po 


^1 


h 


ft 


Coefficients 


-10.0405 


14.1548 


-0.9251 


15.3104 


(SD) 


(0.4593) 


(0.1713) 


(0.0154) 


(0.5575) 



doi:10.1371/journal.pone.0105132.t004 
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and 2*''"*^' = (Zj'''"''', . . . ,2*,'''"'^')"'^ is a working vector whose 
components are given by 



VD >— C\ 



The next estimate ^*^"*+'' can be obtained by 



(35) 



(36) 



ro m m m 



<— T— ^ ^ fN 



\0 yO ^ \D CO 



This new value is used to update <''><''^+'> and 2*'*'**+'' and the 
procedures are repeated until convergence has been achieved. 

Estimating parameter i^'''' is more difficult than estimating 
In principle, (j)^^^ could also be estimated by maximum likelihood, 
although there may be practical difficulties associated with this task 
for some members of Eq. 5. Details about the technique used for 
finding the MLE for a fixed ). can be found in [13]. 

In order to obtain the MLE 1, we replace MLE and 0*'^' in 
(33), which results in the profile log-likelUiood function 
lp(A) = l(fi'^\(j>^'-\X). The plot of the profile likelihood function 
lp{/.) against /. for a sequence of values of X numerically 
determines the MLE for 1. Once the MLE for 2. has been 
obtained, it can be used to produce the unrestricted estimates 

^=^* and^ = 0<^>. 

Assuming that the estimated A is known, the confidence 
intervals for parameters yff*'"' and (^*''' can be calculated in the 
usual context of the GLM and using the adjusted values and 
0*'"'. We consider the approximate covariance matrix oi p^^^ and 

the variance of given in [13] to make inferences about these 
parameters. Here, we have considered the gamma, Gaussian and 
inverse Gaussian distributions for the probability density function 
(Eq. 5) 

We also performed likelihood ratio (LR) tests [23] using a 
statistic w = 2{/p(l) — which has an asymptotic Xi 
distribution for testing /I = and constructed a large sample 
confidence interval for 1 by inverting the LR test. 

Results and Discussion 

Simulation of tfie colonization process of a region 

The parameters of SAR curves are determined from the survey 
data. As a proof of concept we first used simulated data for 80 
species placed in a 50 x 50 cell lattice according to a neutral model 
[24] without dispersal limitation, as applied in [25]. This lattice 
was then resampled so as to establish the shape of the SAR 
(Figure 1). 

The adjusted models are variations of the full-scale model with 
six parameters, given by 



gU^) = Po + Pi ln(^) + fc(ln^)' 
+ li,/A + lijA' + P,A + p,A\ 



(37) 



The following canonical link functions were considered: (i) 
gifi) = /( for the normal GSA\'I, (ii) g{p.) = l/n for the gamma 
GSAM and (iii) gi^i)=l/fi^ for the inverse Gaussian GSAM. 
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Figure 2. Data and fitted model for the real species-area relationship of Hymenoptera in a beech forest on limestone. 
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Moreover, the traditional models presented in the previous section 
were considered by assuming that the random variable S is 
normally distributed. We could estimate the mean of the 
transformed data = but to predict the expected value 

of the untransformed dependent variable S, when the GSAM is 
adjusted to the data, E{S) must be estimated. The dependent 
variable S can be explained by subtracting /( on both sides of Eq. 4 
and solving this equation for S. When A # 0 we can write 



5 = (ASW + l)i/^ = 



l+A^l' 



(38) 



The expected value of the species number S, on the original 
scale, can be evaluated by a first-order approximation of the 
binomial expansion (Eq. 38), as given in detail in [13]: 



E{S) = {\+lti)"'{ 1 



(39) 



The best model can be chosen by using the AIC and BIG 
criteria [26], which are measurements of the relative goodness of 
fit of a statistical model for a given set of data. The mean square 
error (MSE) and mean absolute percent error (MAPE) are given, 
respectively, by 



Table 6. Normal GSAM model fitted to SAR of Hymenoptera. 









/*! 


Pi 


Coefficients 


101.9707 


20.6625 


-46.1669 


(SD) 


(9.4348) 


(0.8211) 


(22.3287) 



doi:l 0.1 371 /journal.pone.01 051 32.t006 
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MSE=^-^±is,-kf (40) 
^ 1=1 

and 

" i=l 

where is the sample variance of S and S, is the estimate of 5, 
given by Eq. 39. 

Table 1 shows some of the traditional models adjusted with 
normal error. The selerted model was the logarithmic quadratic 
function proposed by [21], with minimum AIC = 221.09, 
BIC = 244.39, MSE=l.02% and MAPE = 2.SQ%. 

Table 2 shows the selected models adjusted by the GSAM with 
normal, gamma and inverse Gaussian (I.G.) distributions. Note 
that the adjusted models are variations of the full-scale model. 

Table 3 shows the GSAM models fitted with A adjusted by the 
profile likelihood. The selected normal GSAM has minimum 
AIC = 19.90 and 5/C= 109.02, which are die lowest values 
among the adjusted models. MSE and MAPE of the model are 
also smaller than those of the adjusted model with gamma and 
inverse Gaussian distributions. The adjusted value of parameter A 
is A = 0.822 with confidence interval (0.729; 0.922). Because X is 
different from zero or one, there is a significant difference between 
the results achieved with this model or by using the traditional 
models given in Table 1. Therefore, for this data set, the normal 
GSAM is the model that has best fitted the analyzed data. The 
MLE estimates of the systematic component and standard- 
deviation (SD) of the systematic component are shown in Table 4. 

Figure 1 shows the systematic observation of SAR on the 
original scale and the fitted curve with the adjusted GSMA 
models. E{S) was calculated by Eq. 39 and for the adjusted 
GSAM we obtained MSE = 0.0%% and MAPE = 0.6\%, respec- 
tively. 

Application to real data 

The GSAM model was applied to a data set that consisted of 25 
observations of parasitic insects of the Hymenoplem order in a 
beech forest on limestone. Hymenoptera is one of the largest orders 
of insects that comprise sawflies, wasps, bees and ants. The total 
number of Hymenopteran species in Europe exceeds 20,000. The 
data considered here contain the summary of a long-term study of 
the ecology of parasitic Hymenoptera in a German beech forest, 
i.e. the Gcittingen forest, which is approximately 120 years old and 
has grown over a ground limestone. The climate of the forest is 
typical of Central Europe and the work area covered approxi- 
mately four acres. The study was conducted for 8 years (starting in 
1980) in 144 square meters of forest soil. 

The analysis of the SAR for Hymenoptera is essential, because 
the insects that belong to this order are the most important 
environmental agents fundamental for nutrient recycling and 
control of harmful species. The group is ubiquitous and it is 
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